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Abstract; 

A 140-Mb/s, 32-state, radlx-4, R=l/2, eight-level soft-decision Viterbi decoder has 
been desianed and fabricated usina 1.2-&mu:m double-metal CMOS. The 
architecture of the add-compare-select (ACS) array is based on a restructuring of 
the conventional radlx-2 trellis into a radix-4 trellis. Radlx-4 units, consisting of 
four 4-way ACS units, process two stages of the constituent radix-2 trellis per 
iteration. A four-way ACS circuit achieves an iteration delay 17% longer than 
comparable two-way ACS circuits, resulting in a factor of 1.7 increase in 
throughput. A ring-based ACS placement and state metric routing topology 
achieves an area efficiency comparable to radix-2 designs. In a process referred to 
as pretrace-back, one stage of lookahead is applied to the trace-back recursion, 
combining two radix-4 trace-back iterations into a single radix-16 iteration based 
on 4-b decisions. This allows implementation of trace-back using one compact, 
single-ported decision memory, organized as a cyclic buffer. A 7.30-mmx8.49-mm 
chip containing 146000 transistors achieves a radix-4 iteration rate of 70 MHz 
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A 140-Mb/s, 32-State, Radix-4 Viteibi Decoder 

Peter J. Black. Student Member, IEEE, and Teresa H. Meng, Member, IEEE 



Abstract~\ 140-Mb/s, 32-state, radix-4, R ^ 1/2, eight- 
level soft-decision Viterbi decoder has been designed and fab- 
ricated using 1.2-|un double-metal CMOS. The architecture of 
the add-compare-select (ACS) array is based on a restructur- 
ing of the conventional radlx-2 trellis Into a radlx-4 trellis. Ra- 
dix-4 units, consisting of four 4-way ACS units, process two 
stages of the constituent radbE-2 trellis per Iteration. A 4-way 
ACS circuit is used that achieves an Iteration delay 17% longer 
than comparable two-way ACS circuits, resulting in a factor of 
1.7 increase In throughput. A ring-based ACS placement and 
state metric routing topology Is described for the radix-4 ACS 
array, which achieves an area efficiency comparable to radix-2 
designs. In a process referred to as pretrace-back, one stage of 
lookahead is applied to the trace-back recursion, combining two 
radIx-4 trace-back iterations into a single radix-16 iteration 
based on 4-b decisions. This allows implementation of trace- 
back using one compact, single-ported decision memory, or- 
ganized as a cyclic buffer. The 7.30-mm x 8.49-mm chip con- 
taining 146 000 transistors achieves a radix-4 Iteration rate of 
70 MHz, equivalent to a decode rate of 140 Mb/s under typical 
operating conditions {Vod = 5.0 V, ^ 27*C). 



I. Introduction 

IN recent years there has been interest in implementa- 
tions of the Viterbi algorithm at rates on the order of 
100 Mb/s. Driving applications include convolutional 
decoders for error correction, trellis code demodulation 
for communication channels, and digital sequence detec- 
tion for magnetic storage devices. An important problem 
found in such applications is the decoding of a binary shift 
register (BSR) trellis. The classical high throughput im- 
plementation for such decoders is the radix-2 fully par- 
allel approach, where add-compare-select (ACS) units are 
assigned to each state and organized in pairs to iterate one 
stage of a two-state trellis. The decode rate of this ap- 
proach is fundamentally limited by either the recursive 
ACS iteration or the recursive trace-back iteration. In 
commercial decoder designs, implementing the trace-back 
portion of the algorithm in an area-efficient manner often 
results in the trace-back recursion being the limiting crit- 
ical path. To date, such single chip implementations have 
achieved decode rates around 25 Mb/s [1]. [2]. 

The most common technique for increasing decoder 
throughput is to run multiple decoders in parallel on either 
interleaved data [3] or blocked data [4]. [5]. The resulting 
architecture achieves a speedup that is dirccdy propor- 
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tional to the additional complexity and is referred to as 
ideal linear scaling. Using decoder level parallelism to 
meet a throughput specification results in an overall area 
efficiency approximately equal to diat of the core decoder. 

An alternative approach to high throughput design is to 
exploit new forms of concurrency within a decoder by re- 
formulation of the algorithm [61. However, given the op- 
tion of decoder-level parallelism, such architectures 
should achieve an area efficiency comparable to existing 
designs. Otherwise, the increased throughput could have 
been matched in less area by simply running existing de- 
coders in parallel. To date, the most area-efiicient designs 
are based on the classical fully parallel radix-2 architec- 
ture. In this paper we show that the throughput of this 
architecture can be extended by applying one stage of 
lookahead to both the ACS and trace-back recursions. The 
result is an ACS iteration based on a radix-4 trellis and 
trace-back iteration based on a radix-16 trellis. This ar- 
chitecture is demonstrated in a 32-state. /? = 1/2 Viterbi 
decoder that achieves a decode rate of 140 Mb/s. 

A brief synopsis of this work has been presented in [7]. 
In Section II the Viterbi algorithm and higher radix for- 
mulations for a BSR trellis are described. Section III dis* 
cusses the decoder specification and implementation de- 
tails of all major functional blocks. In Section IV the ACS 
placement and the state metric interconnect routing strat- 
egy for a radix-4 trellis are addressed. Finally, in Section 
V fabrication results are presented. 

11. Viterbi Algorithm 
The Viterbi algorithm is an optimum algorithm for es- 
timating the state sequence of a finite state process given 
a set of noisy observations. An important class of problem 
is the BSR process, the state of which can be described 
by the contents of a binary shift register. An == 1/2 
convolutional encoder is based on a binary shift register. 
For each input bit. two output bits are generated as a mod- 
ulo 2 combination of the shift register contents and the 
input. Given a corrupted encoded sequence, the decoder 
uses the Viterbi algorithm to estimate the state sequence 
of the encoder, for which the original input sequence is 
decoded. 

This section is an overview of the Viterbi algorithm for 
a BSR process of memory length v. We adopt the con- 
vention that the state of the process is given by the shift 
register contents interpreted as a radix-2 number, with the 
MSB corresponding to the oldest sample. A complete dis- 
cussion of the Viterbi algorithm can be found in [8]. 
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A. Decoding a Radix-2 Trellis 

The Viterbi algorithm is typically expressed in terms of 
a trellis diagram, which is a time indexed version of the 
state diagram. The simplest trellis is shown in Fig. 1(a) 
for a two-state BSR process. At time n - 1 there are two 
possible states, each of which can transition to either of 
the two next states at lime n depending on the binary in- 
put. Each transition is weighted according to its likeli- 
hood based on the noisy observations of the process. The 
Viterbi algorithm determines the most likely state se- 
quence by finding the minimum weight path, or shortest 
path, through the trellis. 

Associated with each trellis state 5 at time /i is a state 
metric rj. which is the accumulated metric along the 
shortest path leading to that state, and a decision dl, which 
identifies the entering transition on the shortest path. For 
a 2 "-state BSR process it is convenient to reorganize the 
radix-2 trellis from n - I to n into 2-state radix-2 sub- 
trcllises of the form shown in Fig. 1(a). Each state tran- 
sition is identified by the previous state ix and the next 
state X y, where i is the bit shifted out of the shift register, 
j is the next input bit, and x represents the common state 
bits. The likelihood of this transition is given by the 
branch metric X"^ Given the state metrics of the two 
predecessor states at time n - 1 , the state metric for state 
jcO at time n is given by 

Tf = min (^^ , + V\'. , + Xi^ ,) (1) 

and the decision is given by the state bit (i) shifted from 
the predecessor state that yielded the minimum updated 
metric. This recursive update is the well-known ACS op- 
eration and is implemented using a 2-way ACS unit as 
shown in Fig. 1(b). Since the two output states of the sub- 
trellis have common predecessor states, it is logical to 
combine two 2-way ACS units into a radix-2 unit, which 
updates that state metrics for the radix-2 subtrellis as 
shown in Fig. 1(c). 

The input sequence is decoded using a recursive pro- 
cedure known as trace-back. Given an arbitrary starting 
state and its decision rfj, the previous state is given by 
S,^.l - d^Sn » 1), which corresponds to a 1-b right 
shift of the current shift register state with input equal to 
the current state decision. For example, given 5„ = 001 1 1 
and d^ = I, the previous state is = 10011. This 
recursion proceeds to a depth L known as the survivor 
path length, at which point all the shortest paths (survivor 
paths) from all possible starting states should have 
merged, and the input corresponding to the transition from 
the state at time index n - L is decoded. 

B. Radix-4 Trellis ACS Update 

Using the unified approach to state metric update [91, 
a 2''-state trellis can be iterated from time index n - kio 
n by decomposing the trellis into 2''"* subtrellises, each 
consisting of k iterations of a 2*-state trellis. Each 2*-state 
subtrellis can be collapsed into an equivalent one-stage 
radix-2* trellis by applying k levels of lookahead to the 
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Fig. I. (a) Radix-2 irellis. (b) 2-way ACS. (c) Radix-2 ACS unit. 

recursive ACS update [10], [11]. Collapsing the trellis 
does not affect decoder performance since there is a one- 
to-one mapping between the shortest path in the collapsed 
trellis and the original radix-2 trellis; hence the decoded 
paths are identical. An example of the decomposition for 
an eight-state radix-2 trellis into an equivalent radix-4 
trellis using one stage of lookahead is shown in Fig. 2. 

The iteration lime of a state-parallel Viterbi decoder is 
limited to the iteration time of the ACS. For a radix-2* 
implementation, the effective iteration time is 1/* times 
the delay through a 2*- way ACS, since k iterations of the 
original radix-2 trellis are processed per iteration. If the 
delay dirough a 2*- way and 2-way ACS are equal, a po- 
tential k fold speedup is achievable for a complexity in- 
crease of 2* ~ ' in the ideal case. 

A summary of the potential speedup and complexity as 
a function of trellis radix is given in Table I. In practice 
the ideal speedups cannot be achieved due to the addi- 
tional delay overhead associated with the exponentially 
increasing number of inputs to the compare. The radix-4 
architecture is of special interest because it corresponds 
to the intersection of the exponentially increasing com- 
plexity curve and the ideal linear speedup curve. As a re- 
sult, the radix-4 architecture was chosen because it is the 
only architecture to ofFer a throughput increase while 
maintaining area eflficiency. 

The Slate metric labelling for a radix-4 trellis is shown 
in Fig, 3(a). Since each output stale has four predecessor 
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Fig. 2. (a) 8-siale radix-2 trellis, (b) 4-staic subtrellis decomposition, (c) 
8-state radix-4 trellis. 



TABLE I 

Rad(x-2* Complexity Speed Measures 



Radix k 


Ideal 


Complexity 


Area 


Speedup 


Increase 


Efficiency 


2 1 




1 


1 


4 2 


2 


2 


1 


8 3 


3 


4 


0.75 


16 4 


4 


8 


0.5 



States, state metrics are updated using a four-way ACS 
unit as shown in Fig. 3(b) and decisions consist of 2 b. 
Each radix-4 subtrellis is updated using four 4-way ACS 
units, which are combined to form a radix-4 ACS unit as 
shown in Fig. 4. 

C. Rad'ix-16 Trellis Trace-Back Update 

One stage of lookahead can be applied to the trace-back 
recursion to ease the timing constraints on this potential 
critical path. For a radix-4 trellis the trace-back iterations 
from /I to /I - 2 and from /i - 2 to /i - 4 can be expressed 
as 



2 = dUSn » 2) 



. = ^i-2(5„-2 » 2). (2) 

Applying one stage of lookahead, these radix-4 trace-back 
iterations can be combined into a single radix- 16 iteration 
from mo n - 4 as follows: 

5«„4 = d^„^2idUSn » 2) » 2) 

= di^2d'n{S„ » 4) 

= <n-2(5„»4) (3) 

where ^i.„-2 is the composite 4-b radix- 1 6 decision. For 
example, given the current state decision is = 01 and 
the decision of the previous state as selected by this de- 
cision is f/J ,2 = 11, the composite decision isdi„^2 = 
1101. The composite decisions are calculated prior to run- 
ning the trace-back algorithm and hence must be calcu- 
lated for all states. We refer to the calculation of the com- 
posite decisions as pretrace-back. 

Formation of the pretraced decision for a state requires 
both the current state decision and the decisions of all pos- 
sible previous states. The current state decision defines 
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Fig. 3. (a) Radix^ trellis, (b) 4-way ACS. 
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Fig. 4. Radix-4 ACS unit. 



the previous state and hence can be used to select the pre- 
vious state decision. The selected previous decision and 
the current decision are then combined to form the com- 
posite 4-b radix- 16 decision. This pretrace-back operation 
is implemented simply using a multiplexer as shown in 
Fig. 5(a). We refer to this logical unit as four- way pre- 
trace-back unit. Since the four output states of a radix-4 
trellis have the same four previous states, a radix-4 pre- 
trace-back unit consisting of four 4-way pretrace-back 
units logically maps to a radix-4 subtrellis as shown in 
Fig. 5(b). 

The size of the decision memory is unchanged by the 
use of pretrace-back since 4-b decisions are stored for each 
state for every other ACS iteration, as opposed to 2-b de- 
cisions per state per iteration. Using radix- 16 based trace- 
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Fig. 5, (a) 4-way pretracc-back unit, (b) Radix-4 prctracc-back unit. 

back, 4 b are decoded per iteration at one quarter the rate 
of conventional radix-2 implementations. 

D, Modulo Arithmetic for ACS 

The recursive state metric update results in unbounded 
word growth due to the addition of branch metrics, which 
are always nonnegadve. We avoid normalization using the 
modulo arithmetic approach proposed in [12]. Modulo 
arithmetic exploits the fact that the Viteiti algorithm in- 
herently bounds the maximum dynamic range Aj^^ of state 
metrics to be 

log2 N (4) 

where N is the number of states and X„,„ is maximum 
branch metric for the original radix-2 trellis [13], Given 
two numbers a and b such that \a - b\ < A, which are 
to be compared using subtraction, a result from number 
theory states that the comparison can be evaluated as (a 
~ b) mod 2 A without ambiguity. Hence, the state metrics 
can be updated and compared modulo 2At„ax' appro- 
priately choosing the state metric precision, the modulo 
arithmetic is implicitly implemented by ignoring the state 
metric overflow. 

The required state metric precision is equal to twice the 
maximum dynamic range of the updated state metrics just 
before the compare stage and the number of bits required 
is given by 



(5) 



The term jtX^jj accounts for the potential dynamic range 
increase from the input of the radix-2* ACS to the input 
of the compare stage due to the branch metric addition. 
The equivalent of overflow errors will result if the design 
value for the dynamic range is exceeded during operation; 
hence, the upper bound given in (4) is used to ensure cor- 
rect operation under all SNR conditions. For the 32-state 
radix-4 decoder implemented, k = 2, X^ax =14, A^ax = 
70, and the required state metric precision is 8 b. 

III. Decoder Implementation 

A, Specification 

The specifications for the Viterbi decoder implemented 
are: 

• /l = 6 (32 states), R = 1/2 convolutional decoder, 

• Generator polynomials G\ = 111011 and G2 = 
llOOOI (coding gain of 4.8 dB at 10"^ BER), 

• 8-level soft decision inputs, 

• survivor path length 32 (slightly greater than 5 times 
the constraint length 114]). 

A block diagram of the complete decoder is shown in 
Fig. 6. In the following subsections, implementation de- 
tails of all major blocks are described. 

B. Branch Metric Unit 

Branch metrics for the radix-4 trellis are generated by 
combining branch metrics of successive iterations of the 
underiying radix-2 trellis. For each iteration of the radix- 
2 trellis, two 3-b soft decision inputs C2G1 are combined 
to form four possible branch metrics \„(G2Gi) corre- 
sponding to the four possible encoder outputs (G2G1) e 
{(00), (01), (10), (11)}. Branch metrics are generated us- 
ing a uniform distance measure equal to the symbol itself 
when compared to a logic zero and its one's complement 
when compared to a logic one [14]. Although not imple- 
mented, the radix-4 architecture can support punctured 
codes by simply forming the radix-2 metrics as in a nor- 
mal radix-2 punctured decoder. A block diagram of the 
radix-2 branch metric unit is shown in Fig. 7(a). 

The 16 possible radix-4 branch metrics for iteration 
n - 2 to are formed from the Cartesian product set 
X„_,(G2G|) X \{G2G^) of the radix-2 metrics for iter- 
ations n - 1 and n. In this design the radix-4 metrics are 
centrally calculated, then globally distributed to the ACS 
units. However, since the number of radix-4 metrics is 
the square of the number of radix-2 metrics, in some cases 
(e.g., /? = 1/4 code) it is more area efficient to globally 
distribute the radix-2 metrics and locally calculate the ra- 
dix-4 metrics as required in the ACS units. The block dia- 
gram of the radix-4 branch metric unit is shown in Fig. 
7(b). 

The complete branch metric unit (BMU) consists of two 
radix-2 metric units operating in parallel for iterations 
n ~ 1 and /i, the outputs of which are combined in the 
radix-4 metric unit. Appropriate pipeline stages have been 
added to ensure the metric generation is not the limiting 
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Fig. 7. (a) Radix-2 branch metric unit, (b) Radix-4 branch metric unit. 

critical path. Given 3-b soft decision inputs, 4 b of pre- 
cision are required for the radix-2 metrics and 5 b for the 
radix-4 metrics. 

C. Radix-4 ACS Unit 

In order to achieve the potential twofold increase in 
throughput offered by the nidix-4 architecture, the 4-way 
ACS delay must be equal to that of a 2-way ACS. Fast, 
area-efficient 2-way ACS units are designed using ripple 
carry arithmetic for the add and compare operations. 
Faster adder structures offer little speed advantage at the 
required precision of 8 b and do not warrant the additional 
area overhead, especially when the combined delay of the 
add and compare are considered. By implementing the 




updated State Metric 
Fig. 8. 4-way ACS block diagram. 

compare using subtraction, the adder and subtractor carry 
chains run in parallel from LSB to MSB, resulting in an 
add-compare delay that is only one full adder bit delay 
longer than the 8-b ripple carry add delay alone. 

The block diagram of the 4-way ACS used to achieve 
comparable delay to the reference 2-way ACS is shown 
in Fig. 8. The 4-way compare is evaluated by generating 
the six possible pairwise comparisons and combining the 
results in two levels of logic to form the minimum metric 
selection. Only the subtractor carry chains were imple- 
mented to generate the sign of results and hence the com- 
parison. The adder carry chain circuit is based on the con- 
ventional six-transistor generate-propagate (GP) style 
static logic gate shown in Fig. 9(a). For the subtractors, 
the overhead of GP generation logic was eliminated using 
the equivalent ten transistor compound gate shown in Fig. 
9(b), This structure is slower than the adder circuit; how- 
ever, since the subtractor carry chains are unloaded, 
equalization of the two carry chain critical paths was 
achieved. 

Compared to the benchmark 2-way ACS, the delay 
through the 4-way ACS is increased due to three factors: 
the increased fan-out of the adder outputs, the additional 
logic in generating the multiplexer control signals from 
pairwise comparisons, and the additional delay of 4: 1 
multiplexer compared to a 2 : 1 multiplexer. Overall the 
4-way ACS delay is 1 7 % longer than the 2-way ACS of 
a similar structure, resulting in a speedup factor of 1.7 in 
our final design. 

D, State Metric Initialization 

Forcing a known starting state by appropriate initial- 
ization of state metrics is useful for block decoding and 
selftest. In order fpr the dynamic range bound given in (4) 
to hold, the initial state metrics must satisfy the con- 
straints imposed by the trellis structure and the update al- 
gorithm. For example, state-0 and state- 1 have a common 
ancestor state one iteration back that constrains the state 
metrics to differ at most by the maximum branch metric. 
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Fig. 9. (a) Adder cany chain cifcuit. (b) Subtractor carry chain circuit. 

Valid starting metrics were found by simulation of the 
decoder with all zero inputs until steady-state metrics were 
reached. At this point state-0 is the current and most likely 
state and all other metrics have reached a maximum value 
under the constraints imposed by the trellis. The initial 
state metrics for the 32-state decoder are given in Table 
11. 

E. Trace-back Unit 

In order to match the throughput of the trace-back unit 
to that of the ACS array, implementation of the trace-back 
algorithm [15] described in Section II- A requires L/k 
trace-back recursions per ACS iteration, where L is the 
survivor path length and decisions are based on a radix- 
2* trellis. For the 32-state rddix-4 decoder implemented, 
L = 32, A = 2, and the required trace-back recursion rate 
(TRR) is 16 per ACS iteration. For typical implementa- 
tions the recursion rate is limited to less than two. 

The block-based architecture in [16] achieves a TRR of 
one using four independent single-ported decision mem- 
ories, each of depth L. This architecture has been gener- 
alized into a class of multiple read-pointer architectures 
(equivalent to multiple trace-back units) [17]. Using six 
independent memories, the total memory depth can be re- 
duced to 3L {L/2 per memory). By increasing the radix 
of the trellis using pretraced decisions as described in Sec- 
tion II-C, we are able to implement the simplest block- 
based architecture using one single-ported decision mem- 
ory of size 3L operating at a TRR of one. 

The decision memory is organized as a cyclic buffer 
and is conceptually partitioned into a read and write re- 
gion as shown in Fig. 10. During each block decode 
phase, new pretraced decisions are written to the write 
region on even clock cycles, while the survivor path is 
traced and decoded from the read region on odd clock 
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Fig. 10. Decision memory organization. 

cycles (a clock is equivalent to a single radix-4 ACS it- 
eration). Trace- back of the read region proceeds to a depth 
L, the final state of which is used to initialize the trace- 
back and decode of the remaining block of length D. Once 
a block is decoded its decisions can be discarded and it 
becomes the write region of the next phase. 

Given the constraint that the block write and trace-back 
operations must complete in the same time interval, it can 
be easily shown that the required TRR is given by 

TRR = 1/2 ^1 + (6) 

For each set of read vectors, the trace-back recursions 
have two cycles to complete, thus restricting the TRR to 
be a multiple of 1/2. Values of D that are of practical 
interest are L and L/2, corresponding to a total decision 
memory depths of 3L and 2L. respectively. Based on area 
estimates for the decision memory and periphery logic, 
the 3L architecture was 20% smaller than the 2L. This 
apparent paradox is due to the fact that the additional pe- 
ripheral logic required to support 1.5 trace-back recur- 
sions per cycle more than offset the area gains achieved 
by reducing the decision memory depth. Given the smaller 
area and slower TRR of one, the 3L architecture was cho- 
sen for implementation. 

The block diagram of the complete decision memory 
and trace-back unit is shown in Fig. 11. Pretraced deci- 
sions from the ACS array are stored as 32-state by 4-b 
vectors in the decision memory prior to trace-back. Given 
a survivor path length of 32 for the radix-2 trellis, the 
effective radix- 16 survivor path length is equal to eight 
vectors, and the total memory size is 24 vectors. Each 
decision memory word contains two vectors, which al- 
lows a double vector read. Decision vector writes alter- 
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nate between the left and right halves of the memory. Read 
cycles fetch two successive decision vectors in parallel, 
which are then processed in the trace-back unit. 

Each trace-back recursion consists of a 32'Way selec- 
tion from the read decision vector based on the current 
5-b slate. The selected 4-b decision and the current state 
are combined according to (3) to form the next state. The 
timing of the trace-back critical path is such that the up- 
dated state is available before the end of the next read 
cycle. This is exploited by completing one level of the 
32-way selection during the read. The MSB and LSB of 
the current state are used as column address bits, selecting 
even or odd states from the left and right halves of the 
decision memory. This reduces the trace-back recursion 
selection to 1 of 16 states using a 16: 1 multiplexer. 

By combining the cyclic bulfer memory management 
scheme and pretraced decisions, we achieved a trace-back 
throughput matched to the ACS array in an area-efficient 
manner using a compact single- ported decision memory. 

F. Decoder Output LIFO 

The block based trace-back algorithm naturally pro- 
duces a decode sequence which is time reversed within 
each block. Correct temporal ordering of the decoder out- 
put is achieved using two last-in first-out (LIFO) buffers 
that double buffer each decoded block. 

G. Self-Test Unit 

The convolutional encoder was also implemented on the 
same chip for completeness and to facilitate an extensive 
on-chip self-test. In self-test mode, the encoder input is 
switched to a pseudorandom sequence generator. The bi- 
nary encoder output symbols are converted to a 3-b soft 
decisions for input to the decoder. The mapping is chosen 
to ensure error-free decoding; hence, a delayed version of 
the original encoder input can be compared to the decoder 
output to verify functionality. The soft-decision mapping 
chosen was 0 Oil and 1 100. This noisy mapping 
was chosen over the obvious noiseless bit extension be- 
cause it ensures nonzero branch metrics and state metric 
growth. 

The length of the pseudorandom sequence generator 
was chofien to ensure extensive test coverage of the de- 
cision memory. When operating correctly in self-test 
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mode, it can be asserted that each decoded 4-b decision 
is both written and read correctly. The 20-lap pseudo- 
random sequence generator ensures that decoded state se- 
quence addresses every 4-b decision location with all pos- 
sible bit patterns. 

H. Clocking and I/O 

A single-phase clocking strategy was chosen to sim- 
plify clock distribution while maximizing system 
throughput. All latch stages are implemented using rising 
edge true-single-phase circuits [18]. The global clock is 
distributed up the center of the chip, from which local 
module clocks are derived using two-stage buffers of 
identical delay. 

To avoid running the external I/O at the internal clock 
rate of 70 MHz, the input stream is parallelized to run at 
half the internal clock rate. For consistency a half-rate 
clock is supplied to the chip and internally doubled in rate. 
At the external clock rate of 35 MHz, four input symbol 
pairs G| G2 are input per cycle and four decoded bits are 
output per cycle, corresponding to a throughput of 140 
Mb/s. 

IV. Radix-4 ACS Placement and Routing 
In fully parallel decoders the state metric interconnect 
represents a significant portion of the ACS array area. The 
radix-4 trellis interconnect, with four inputs per ACS, ap- 
pears as though it might be twice as complex as a radix- 
2 trellis. However, by arranging the four- way ACS units 
into radix-4 units as shown in Fig. 4, and considering the 
ACS array as an interconnect of such radix-4 units, the 
number of global state metric transfers is equal to the 
number of states, which is the same as that achieved in 
radix-2 arrays. 

An area-sufficient topology for radix-2 arrays is a ring 
of radix-2 units arranged in two columns [19]. Each node 
has an output that connects directly to an adjacent unit, 
while the other output connects to a central global state 
metric routing channel. Extending this topology to the ra- 
dix-4 case results in 75% of the global state metric con- 
nections being resolved in the central route compared to 
50% for the radix-2 case. The following describes a mod- 
ification of the ring topology for a radix-4 array, which is 
both area efficient and regular in routing structure. In fact, 
the regularity was such that the ACS array layout was pro- 
duced entirely using a tile based generator, without the 
need for a channel router. 

A, Radix-4 ACS Ring Topology 

The global state metric interconnect can be described 
by a directed graph. Vertices correspond to radix-4 ACS 
units and edges correspond to the global state metric 
transfers. An example of the interconnect graph for a 16- 
state radix-4 trellis is shown in Fig. 12(a). 

Radix-4 units are organized into a ring, such that the 
sequence of units around the ring corresponds to a Ham- 
ilton cycle in the interconnect graph. This guarantees that 
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each unit has an output that connects to the adjacent unit^ 

tnnoloBV is chosen from a search space consisting u 
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pate 1.8 W at Vod = 5.0 V and = 27'*C. This corre- 
sponds to a radix-4 iteration rate of 70 MHz, equivalent 
to a radix-2 decode rate of 140 Mb/s. 

VI. Summary 

We have shown that the throughput of a radix-2 state- 
parallel Vilerbi decoder can be increased using higher ra- 
dix formulations of the trellis. The radix-4 architecture is 
the radix of choice because it is the only architecture to 
offer ideal linear scaling and a practically achievable 
throughput increase. Throughput doubling relies on im- 
plementing a 4-way ACS operation at the same rate as a 
2-way ACS. A 4-way ACS circuit has been presented that 
achieves this goal to within 17%, resulting in an overall 
speedup of a factor of 1.7. 

A preprocessing stage has been added to the trace-back 
portion of the Viterbi algorithm, which further collapses 
the radix-4 trellis into a radix- 16 structure. Using pre- 
traced decisions, implementation of the trace-back algo- 
rithm is possible without fragmenting the decision mem- 
ory into multiple memories and without running multiple 
trace-back recursions in parallel. The radix- 16 trace-back 
architecture is implemented using a single-ported decision 
memory of depth equal to three times the survivor path 
length. 

The state metric and branch metric routing of a radix- 
4 ACS array represents a significant proportion of the ar- 
ray area. An area-efficient ring-based radix-4 ACS place- 
ment and routing topology has been proposed, resulting 
in a routing overhead comparable to radix-2 designs. 

These concepts are demonstrated in a 32-state R = \/2 
Viteibi decoder, implemented using 1.2-/im CMOS tech- 
nology. A radix-4 iteration rate of 70 MHz corresponding 
to an effective radix-2 decode rate of 140 Mb/s was 
achieved. This represents a significant increase over cur- 
rent commercial radix-2 designs of equivalent complexity 
that have been limited to a decode rate of 25 Mb/s. 
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